Model Selection

Cross-modal Understanding

# Cross-modal Understanding

Internvl3 78B Hf

InternVL3 is an advanced multimodal large language model series with powerful multimodal perception and reasoning capabilities, supporting image, video, and text inputs.

Transformers Other

Cephalo Gemma 3 4b It 04 16 2025

Cephalo-Gemma-3-4b is a vision-language model specialized in biomaterials and spider silk analysis, fine-tuned based on the Gemma architecture.

Qwen2.5 Omni 7B

Qwen2.5-Omni is an end-to-end multimodal model capable of perceiving various modalities such as text, images, audio, and video, and generating text and natural speech responses in a streaming manner.

Multimodal Fusion

Transformers English

Centurio is an open-source multilingual large vision-language model supporting 100 languages, capable of processing image-to-text and text-to-text tasks.

Transformers Supports Multiple Languages

Thaicapgen Clip Gpt2

An encoder-decoder model based on CLIP encoder and GPT2 architecture for generating Thai image descriptions

Image-to-Text Other

Meta Chameleon is a hybrid-modal early fusion foundation model developed by FAIR, supporting multimodal processing of images and text.

Multimodal Fusion

This model is an image-to-text model based on the Apache-2.0 license, capable of converting image content into textual descriptions.

Text Recognition

Blip Image Captioning Large

BLIP is a unified vision-language pretraining framework, excelling in image caption generation and understanding tasks, efficiently utilizing web data through guided annotation strategies

General Image Captioning

This is an image-to-text model based on the Apache-2.0 license, capable of converting image content into textual descriptions.

Text Recognition

Transformers Other

CLIP ViT B 16 DataComp.XL S13b B90k

This is a CLIP ViT-B/16 model trained using OpenCLIP on the DataComp-1B dataset, primarily used for zero-shot image classification and image-text retrieval.

Pix2struct Docvqa Base

Pix2Struct is an image encoder-text decoder model trained on image-text pairs, supporting various tasks including image captioning and visual question answering.

Transformers Supports Multiple Languages

Mscoco Finetuned CoCa ViT L 14 Laion2b S13b B90k

This is an image-to-text model based on the MIT license, capable of converting image content into textual descriptions.

Vinvl Base Image Captioning

Microsoft's VinVL foundational pre-trained model, specifically designed for image captioning tasks, with strong visual-language understanding capabilities.

michelecafagna26

Chinese Clip Vit Large Patch14 336px

Chinese CLIP is a simplified implementation of CLIP based on approximately 200 million Chinese image-text pairs, using ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder.

molt5-base is a model based on the T5 architecture, specifically designed for translation tasks between molecules and natural language.

Machine Translation

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase